A new look on the genotype-by-environment interaction: enviromics and probabilistic models

8th International Meeting on Plant Breeding

1 Introduction

This guide will help you using ProbBreed’s functions, from fitting Bayesian models to computing probabilities. If you need more theoretical details, check the papers Dias et al. (2022) and Chaves et al. (2024). The vignette and the help page of each function (for e.g., help(prob_sup)) have more details on the functions arguments. Also, make sure you have package installed. For the CRAN (stable) version:

install.packages("ProbBreed")

For installing the development version (everything we do before submitting to CRAN), use:

# install.packages("devtools")
devtools::install_github("saulo-chaves/ProbBreed")

After installing, load the package:

library(ProbBreed)

2 Available datasets

The package contains two dataset that can be used as examples to run the functions: soy and maize. soy belongs to the USDA Northern Region Uniform Soybean Tests, and it is a subset of the data used by Krause et al. (2023). It contains the empirical best linear unbiased estimates of genotypic means of the seed yield from 39 experimental genotypes evaluated in 14 locations.

head(soy)
  Loc Gen        Y
1 E01 G01 49.89367
2 E01 G02 48.51867
3 E01 G03 49.81867
4 E01 G04 47.31867
5 E01 G05 47.91867
6 E01 G06 53.01867

maize belongs to value of cultivation and use maize trials of Embrapa Maize and Sorghum, and was used by Dias et al. (2022). It contains the grain yield of 32 single-cross hybrids and four commercial checks (36 genotypes in total) evaluated in 16 locations across five regions or mega-environments. These trials were laid out in incomplete blocks design, using a block size of 6 and two replications per trial

head(maize)
  Region Location Rep Block Hybrid   GY
1     TB      E10   1     3     G9 9.65
2     TB      E10   1     6    G20 6.72
3     TB      E10   1     5    G34 6.67
4     TB      E10   1     4    G18 7.69
5     TB      E10   1     6    G30 8.99
6     TB      E10   1     5    G27 9.53

We will begin our tutorial with the soy dataset, and end with the maize dataset. We will explain the reason for this in a pinch.

3 First step: fitting the Bayesian model

The first step is to fit a multi-environment Bayesian model using the bayes_met function. This function has a set of predefined models that are used to run Bayesian analyses using rstan. Currently, the function has nine options of models (Figure 1).

Figure 1: Options to declare replications and/or blocks (repl), years (year) and regions (reg) effects in the bayes_met function. Users must substitute Repl, Block, Year, and Region by the name of the column that contains the information of replicates, blocks (if applicable), year (if available) and region (if available). RCDB and IBD are acronyms for randomized complete block design and incomplete block design, respectively. A model using only adjusted means can only be fitted with reg = NULL and year = NULL

These models differ according to the considered information: only locations, locations and years, locations and breeding regions, and locations, years and breeding regions. Should you consider an “environment” as a combination of environmental factors (for instance, location and planting dates within location), this information must be in the loc argument. All models will have gen and loc. If you want to consider the effect of regions (or mega-environment) and/or time factors (for instance, years or harvests), then reg and year must represent the column that contains the information on these effects. Otherwise, these arguments are set to NULL. You may control the experimental design effects in the repl argument. If repl = NULL, bayes_met will assume that you are entering with adjusted means of each environment, i.e., data without replicates. If you have data from trials laid out in randomized complete blocks, repl = 'Replicate'. Finally, if the multi-environment trials were laid out in incomplete blocks design, repl = c('Replicate', 'Block'). By default, when repl = NULL, reg and year must also be NULL.

The Bayesian models implemented in ProbBreed have predefined priors described by Dias et al. (2022). In summary: \[ x \sim \mathcal{N}\left(0,S^{[x]}\right) \]

\[ \sigma \sim \mathcal{HalfCauchy}\left(0, S^{[\sigma]}\right) \]

where \(x\) can be any effect but the residual, and \(\sigma\) is the standard deviation of the likelihood. If an heterogeneous model is set (res.het = TRUE), then \(\sigma_k \sim HalfCauchy\left(0, S^{[\sigma_k]}\right)\).

The hyperpriors are set as follows:

\[ S^{[x]} \sim HalfCauchy(0, \phi) \]

where \(\phi\) is the known global hyperparameter defined such as \(\phi = max(y) \times 10\). By doing this, we restrain the hyperparameter to be finite and allow the model to take full advantage of the data to infer the posterior distribution since it provide weakly informative prior distributions, avoiding subjectivity that can possibly hamper the results. The half-Cauchy distribution was used to guarantee that variance components will not be negative.

Let’s check how can we fit the Bayesian models for the soy and maize datasets:

  • soy (we will run this)
mod_soy = bayes_met(data = soy,
                gen = "Gen",
                loc = "Loc", 
                repl = NULL,
                trait = "Y",
                res.het = TRUE,
                iter = 2000, 
                cores = 4, 
                chains = 4)
  • maize (homework: run this latter)
mod_maize = bayes_met(data = maize,
                gen = "Hybrid",
                loc = "Location",
                repl = c("Rep","Block"),
                trait = "GY",
                reg = "Region",
                year = NULL,
                res.het = TRUE,
                iter = 2000, 
                cores = 4,
                chain = 4)

You may change the number of iterations, chains, and cores. This will vary according to the data and the processing capacity of your machine. Be aware that the more iterations and chains, the longer the computation time. At the same time, results are likely more reliable, and mixing/convergence issues will diminish. By default, if more than one core is set, the function runs one chain per core. The number of cores depends on the processing capacity of your machine. Choose wisely.

ProbBreed uses rstan to fit Bayesian models, so packages and functions designed to manage and explore these models can be used in the output object of the bayes_met function. Some interesting options are rstantools, shinystan, and bayesplot.

Check out the shinystan package:

library(shinystan)
shinystan::launch_shinystan(mod)

4 Second step: explore the model’s outputs

After fitting the model, the next step is to extract outputs using extr_outs. This function also provides other useful information, like variance components and posterior predictive checks. Using the mod_soy object from the previous step:

outs = extr_outs(model = mod_soy,
                 probs = c(0.05, 0.95), 
                 verbose = TRUE)

where:

  • model is the model fitted using bayes_met

  • probs are the probabilities considered to calculate the quantiles and the highest posterior density (HPD)

This function provides an object of class extr, which is a list with the posterior distribution of each effect, the data generated by the model, the maximum posterior values of each effect, the variances of each effect, and quality parameters of the model (see below).

outs$variances
        effect     var      sd naive.se HPD_0.05 HPD_0.95
1          Gen   3.373   1.308    0.021    1.696    5.814
2          Loc 246.870 125.446    1.983  117.417  475.870
3   error_env1  10.069   2.670    0.042    6.471   15.045
4   error_env2  29.046   7.055    0.112   19.479   42.455
5   error_env3  11.561   3.313    0.052    7.252   17.680
6   error_env4  18.536   4.907    0.078   11.982   27.364
7   error_env5  51.325  12.414    0.196   34.575   73.848
8   error_env6  15.730   4.132    0.065   10.276   23.350
9   error_env7  19.356   4.929    0.078   12.758   28.475
10  error_env8  21.170   5.503    0.087   13.824   31.311
11  error_env9  14.878   3.809    0.060    9.673   21.735
12 error_env10  13.528   5.673    0.090    6.863   24.331
13 error_env11  22.291   9.237    0.146   11.756   39.050
14 error_env12   7.450   3.236    0.051    3.610   13.345
15 error_env13  14.689   5.942    0.094    7.623   25.672
16 error_env14  10.986   4.363    0.069    5.690   19.079
outs$ppcheck
                  Diagnostics
p.val_max              0.9170
p.val_min              0.3608
p.val_median           0.7418
p.val_mean             0.5130
p.val_sd               0.5540
Eff_No_parameters     27.3972
WAIC2               2715.5314
mean_Rhat              1.0011
Eff_sample_size        0.6399

The S3 method plot will generate some useful plots, like the density plots and histograms built from the posterior effects (Figure 2). It can also build trace plots.

plot(outs, category = "density")
plot(outs, category = "histogram")
(a) Density plots
(b) Histograms
Figure 2: Posterior effects’ distribution as depicted by density plots and histograms

A particularly useful plot is the comparison between the empirical and sampled phenotype, which illustrates the model’s convergence (Figure 3). The more the density plots overlap, the more successful was the model in sampling the effects.

plot(outs)
Figure 3: Density plots comparing the distributions of empirical and sampled phenotypes. The more overlapped, the better.

5 Third step: compute the probabilities

With the outputs extracted by extr_outs, we can calculate the probabilities of superior performance and superior stability of the evaluated genotypes with prob_sup:

results = prob_sup(extr = outs, 
                   int = .2,
                   increase = TRUE, 
                   save.df = FALSE, 
                   verbose = FALSE)

where:

  • increase: The objective is for increasing (TRUE, default) or decreasing (FALSE) the trait mean

  • extr: An extr object that contains the outputs of the extr_outs() function.

  • int: The selection intensity, expressed in decimal values. In the example, the selection intensity is 20%.

  • save.df: TRUE if you want to save the data frames containing the calculated probabilities in the working directory, FALSE otherwise.

This function generates an object of class probsup, which consists of two lists: across and within. The first list contains the probabilities of superior performance and superior stability across environments, while the second contains the probabilities of superior performance within environments. probsup is also compatible with the S3 method plot. See the details below or in help("plot.probsup")

Important

Within-environment probabilities of superior performance, and probabilities of superior stability are unavailable for entry-mean models (i.e., the one we used for the soy data), since we have no explicit genotype \(\times\) environment interaction effect in this case.

Let’s check the available outputs for soy:

5.1 HPD of the genotypic effects

head(results$across$g_hpd)
  gen          g      HPD95   HPD97.5       HPD5     HPD7.5
1 G01  1.3762307  2.9176431 3.2627871 -0.1692294 -0.4855525
2 G02 -1.4201714  0.4352518 0.8312823 -3.3015241 -3.6673793
3 G03  0.3276901  2.1634450 2.4905701 -1.5210447 -1.8249005
4 G04 -1.0515739  0.7217608 1.1063974 -2.9132463 -3.2910979
5 G05 -2.0253358 -0.1852200 0.1379327 -4.0239070 -4.3471062
6 G06  0.3947565  2.1778044 2.5478518 -1.4134656 -1.7735013
plot(results, category = "hpd")
Figure 4: Caterpillar plot represeting the Highest posterior density (HPD) of the posterior genotypic main effects

5.2 Probability of superior performance

\[ Pr\left(\hat{g}_j \in \Omega \vert y\right) = \frac{1}{S} \sum_{s=1}^S I\left(\hat{g}_j^{(s)} \in \Omega \vert y\right) \]

head(results$across$perfo)
    ID    prob
36 G36 0.98175
9  G09 0.84125
20 G20 0.82800
38 G38 0.72025
31 G31 0.66900
1  G01 0.51575
plot(results)
Figure 5: Lollipop plot with the probabilities of superior performance of selection candidates

5.3 Pairwise probability of superior performance

\[ Pr\left(\hat{g}_{j} > \hat{g}_{j^\prime} \vert y\right) = \frac{1}{S} \sum_{s=1}^S I\left(\hat{g}_{j}^{(s)} > \hat{g}_{j^\prime}^{(s)} \vert y\right) \]

This equation is adequate for when we want to increase the trait value. If we set the argument increase = FALSE, then \(>\) is changed to \(<\) in the equation above.

results$across$pair_perfo[1:5, 1:5]
        G01     G02     G03    G04     G05
G01 0.00000 0.02475 0.23725 0.0445 0.00725
G02 0.97525 0.00000 0.86450 0.5985 0.34250
G03 0.76275 0.13550 0.00000 0.1815 0.06350
G04 0.95550 0.40150 0.81850 0.0000 0.25700
G05 0.99275 0.65750 0.93650 0.7430 0.00000
plot(results, category = "pair_perfo")
Figure 6: Heatmap with the pairwise probabilities of superior performance. The heatmap depicts the probability of genotypes in the x-axis being superior to those in the y-axis.

This is as far as we can go with the soy data. Now, let’s shift to the maize data. To speed things up, let’s load the object containing the probabilities:

probs_maize = readRDS(file = "probs_maize.rds")

5.4 Probability of superior stability

\[ Pr\left[var\left(\widehat{ge}_{jk}\right) \in \Omega \vert y\right] = \frac{1}{S} \sum_{s=1}^S I\left[var\left(\widehat{ge}_{jk}^{(s)}\right) \in \Omega \vert y\right] \]

Note that this probability can only be computed across environments since it depends on \(var\left(\widehat{ge}_{jk}\right)\).

head(probs_maize$across$stabi$gl)
    ID    prob
8  G16 0.52225
25 G31 0.46600
28 G34 0.40550
4  G12 0.39975
24 G30 0.37625
29 G35 0.37100
head(probs_maize$across$stabi$gm)
    ID    prob
36  G9 0.27375
32  G5 0.26775
28 G34 0.26675
18 G25 0.26075
24 G30 0.26050
26 G32 0.25900
plot(probs_maize, category = "stabi")
Figure 7: Lollipop plot with the probabilities of superior stability (per location and region) of selection candidates

5.5 Pairwise probability of superior stability

\[ Pr\left[var\left(\widehat{ge_{jk}}\right) < var\left(\widehat{ge}_{ik}\right) \vert y\right] = \frac{1}{S} \sum_{s=1}^S{I \left[var\left(\widehat{ge}_{jk}^{(s)}\right) < var\left(\widehat{ge}_{ik}^{(s)}\right) \vert y\right]} \]

Note that, in this case, we are interested in the genotype that has a lower variance of the genotype-by-environment (or genotype-by-region) interaction effects.

probs_maize$across$pair_stabi$gl[1:5, 1:5]
         G1     G10     G11     G12    G13
G1  0.00000 0.74525 0.75875 0.88250 0.7850
G10 0.25475 0.00000 0.52025 0.70625 0.5495
G11 0.24125 0.47975 0.00000 0.69650 0.5410
G12 0.11750 0.29375 0.30350 0.00000 0.3470
G13 0.21500 0.45050 0.45900 0.65300 0.0000
probs_maize$across$pair_stabi$gm[1:5, 1:5]
         G1     G10     G11     G12     G13
G1  0.00000 0.56225 0.58550 0.56725 0.57525
G10 0.43775 0.00000 0.52250 0.51900 0.52425
G11 0.41450 0.47750 0.00000 0.49650 0.49525
G12 0.43275 0.48100 0.50350 0.00000 0.49850
G13 0.42475 0.47575 0.50475 0.50150 0.00000
plot(probs_maize, category = "pair_stabi")
Figure 8: Heatmap with the pairwise probabilities of superior stability. The heatmap depicts the probability of genotypes in the x-axis having lower GEI variance than those in the y-axis.

5.6 Joint probability of superior performance and superior stability

\[ Pr\left[\hat{g}_j \in \Omega \cap var\left(\widehat{ge}_{jk}\right) \in \Omega\right] = Pr\left(\hat{g}_j \in \Omega\right) \times Pr\left[var\left(\widehat{ge}_{jk}\right) \in \Omega\right] \]

head(probs_maize$across$joint)
     ID    level category        prob
145  G1 Location    Joint 0.026392500
146 G10 Location    Joint 0.000000000
147 G11 Location    Joint 0.006300000
148 G12 Location    Joint 0.151805062
149 G13 Location    Joint 0.133568750
150 G14 Location    Joint 0.001108125
plot(probs_maize, category = "joint")
Figure 9: Lollipop plot with the joint probability of superior performance and stability of selection candidates.

5.7 Probabilities within environments

When GEI effects are estimated (in all situation but the entry-mean model), we can calculate probabilities of superior performance within individual environments:

5.7.1 Probability of superior performance

Instead of using the marginal genotypic effect of each genotype, we use the within environment genotypic effect \(\left(g_{jk} = g_j + ge_{jk}\right)\):

\[ Pr\left(\hat{g}_{jk} \in \Omega_k \vert y\right) = \frac{1}{S} \sum_{s=1}^S I\left(\hat{g}_{jk}^{(s)} \in \Omega_k \vert y\right) \]

head(probs_maize$within$perfo$gl)
  gen      E1     E10     E11     E12     E13     E14     E15     E16      E2
1  G1 0.98325 0.26325 0.38075 0.44300 0.27800 0.94925 0.32450 0.97100 0.89000
2 G10 0.01125 0.00025 0.00075 0.00300 0.02125 0.07200 0.00000 0.04075 0.10100
3 G11 0.00550 0.09850 0.11325 0.00375 0.61600 0.10925 0.75000 0.09850 0.01850
4 G12 0.15950 0.08575 0.04500 0.06200 0.06050 0.30525 0.16000 0.56775 0.42625
5 G13 0.20900 0.18625 0.13600 0.13175 0.06300 0.58950 0.81750 0.06975 0.22175
6 G14 0.00000 0.00450 0.10925 0.16250 0.02975 0.06000 0.00075 0.02325 0.00225
       E3      E4      E5      E6      E7      E8      E9
1 0.63450 0.91950 0.52275 0.30625 0.00350 0.85675 0.37450
2 0.20125 0.00200 0.04025 0.01875 0.00125 0.00050 0.04500
3 0.03050 0.02025 0.11150 0.33350 0.01575 0.04425 0.23275
4 0.59875 0.18200 0.37100 0.45225 0.81950 0.15950 0.30325
5 0.70250 0.69550 0.60050 0.82175 0.46250 0.11825 0.53675
6 0.04975 0.13150 0.11350 0.62225 0.02850 0.03475 0.07025
plot(probs_maize, category = "perfo", level = "within")
Figure 10: Heatmap with the within-environment (location or region) probabilities of superior performance of the selection candidates

5.7.2 Pairwise probability of superior performance

\[ Pr\left(\hat{g}_{jk} > \hat{g}_{j^\prime k} \vert y\right) = \frac{1}{S} \sum_{s=1}^S I\left(\hat{g}_{jk}^{(s)} > \hat{g}_{j^\prime k}^{(s)} \vert y\right) \]

Or \(<\) instead of \(>\) if increase = TRUE

The comparison are made per environment, so it impossible to illustrate them all in a single plot. plot.probsup will ask you if you are storing the figures in an object. If not, the function won’t build the plots. Long story short: store the outcomes of plot.probsup in an object in the case of within-environment pairwise probability of superior performance.

head(probs_maize$within$perfo$gl)
pairsupwithin = plot(probs_maize, category = "pair_perfo", level = "within")
pairsupwithin$gl$E16
pairsupwithin$gm$NE
  gen      E1     E10     E11     E12     E13     E14     E15     E16      E2
1  G1 0.98325 0.26325 0.38075 0.44300 0.27800 0.94925 0.32450 0.97100 0.89000
2 G10 0.01125 0.00025 0.00075 0.00300 0.02125 0.07200 0.00000 0.04075 0.10100
3 G11 0.00550 0.09850 0.11325 0.00375 0.61600 0.10925 0.75000 0.09850 0.01850
4 G12 0.15950 0.08575 0.04500 0.06200 0.06050 0.30525 0.16000 0.56775 0.42625
5 G13 0.20900 0.18625 0.13600 0.13175 0.06300 0.58950 0.81750 0.06975 0.22175
6 G14 0.00000 0.00450 0.10925 0.16250 0.02975 0.06000 0.00075 0.02325 0.00225
       E3      E4      E5      E6      E7      E8      E9
1 0.63450 0.91950 0.52275 0.30625 0.00350 0.85675 0.37450
2 0.20125 0.00200 0.04025 0.01875 0.00125 0.00050 0.04500
3 0.03050 0.02025 0.11150 0.33350 0.01575 0.04425 0.23275
4 0.59875 0.18200 0.37100 0.45225 0.81950 0.15950 0.30325
5 0.70250 0.69550 0.60050 0.82175 0.46250 0.11825 0.53675
6 0.04975 0.13150 0.11350 0.62225 0.02850 0.03475 0.07025
Are you using an object to store the outputs of this function? Enter Y/n: 
(a) Location E16
(b) Region NE
Figure 11: Heatmap with the within-environment pairwise probabilities of superior performance

Like the other heatmaps, we are evaluating the probability of genotypes at x-axis being superior than genotypes at the y-axis.

6 Wrap up

The estimated probabilities demonstrated in this tutorial from ProbBreed are related to some key question that constantly arises in plant breeding:

  • What is the risk of recommending a selection candidate?

  • What is the probability of a given selection candidate having good performance across or within environments?

  • What is the probability of a selection candidate having better performance than a check cultivar check?

  • How probable is it that a given selection candidate performs similarly across environments?

  • What are the chances that a given selection candidate is more stable than a cultivar check?

  • What is the probability that a given selection candidate having a superior and invariable performance across environments?

If you used it for a scientific publication, please, cite it:

citation("ProbBreed")
Thank you for citing ProbBreed in your publication! Please, use:

  Chaves SFS, Krause MD, Dias LAS, Garcia AAF, Dias KOG (2024).
  "ProbBreed: a novel tool for calculating the risk of cultivar
  recommendation in multienvironment trials." _G3
  Genes|Genomes|Genetics_, *14*(3), jkae013.
  doi:10.1093/g3journal/jkae013
  <https://doi.org/10.1093/g3journal/jkae013>.

Uma entrada BibTeX para usuários(as) de LaTeX é

  @Article{ProbBreed_paper,
    title = {ProbBreed: a novel tool for calculating the risk of cultivar recommendation in multienvironment trials},
    author = {Saulo F. S. Chaves and Matheus D. Krause and Luiz A. S. Dias and Antonio A. F. Garcia and Kaio O. G. Dias},
    journal = {G3 Genes|Genomes|Genetics},
    year = {2024},
    volume = {14},
    number = {3},
    pages = {jkae013},
    doi = {10.1093/g3journal/jkae013},
  }

7 References

Chaves, S. F. S., Krause, M. D., Dias, L. A. S., Garcia, A. A. F., & Dias, K. O. G. (2024). ProbBreed: A novel tool for calculating the risk of cultivar recommendation in multienvironment trials. G3 GenesGenomesGenetics, 14(3), jkae013. https://doi.org/10.1093/g3journal/jkae013
Dias, K. O. G., Santos, J. P. R., Krause, M. D., Piepho, H.-P., Guimarães, L. J. M., Pastina, M. M., & Garcia, A. A. F. (2022). Leveraging probability concepts for cultivar recommendation in multi-environment trials. Theoretical and Applied Genetics, 135(4), 1385–1399. https://doi.org/10.1007/s00122-022-04041-y
Krause, M. D., Dias, K. O. G., Singh, A. K., & Beavis, W. D. (2023). Using soybean historical field trial data to study genotype by environment variation and identify mega-environments with the integration of genetic and non-genetic factors. bioRxiv. https://doi.org/10.1101/2022.04.11.487885